A Partition-Based Efficient Algorithm for Large Scale Multiple-Strings Matching

نویسندگان

  • Ping Liu
  • Yanbing Liu
  • Jianlong Tan
چکیده

Filtering procedure plays an important role in the Internet security and information retrieval fields, and usually employs multiple-strings matching algorithm as its key part. All the classical matching algorithms, however, perform poorly when the number of the keywords exceeds 5000, which made large scale multiple-strings matching problem a great challenge. Based on the observation that the speed of the classical algorithms depends on the minimal length of the keywords, a partition strategy was proposed to decompose the keywords set into a series of subsets on which the classical algorithms was performed. In the optimal partition, it was proved that the keywords with same length would be separated into one subset, and length of keywords in different subsets would not interlace each other. In this paper, we proposed a shortest-path model for the optimal partition problem. Experiments on both random dataset and ClamAV dataset demonstrated our algorithms works much better than the classical ones.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Partition-Based Efficient Algorithm for Large Scale Multiple-Strings Matching

Filtering plays an important role in the Internet security and information retrieval fields, and usually employs multiple-strings matching algorithm as its key part. All the classical matching algorithms, however, perform badly when the number of the keywords exceeds a critical point, which made large scale multiple-strings matching problem a great challenge. Based on the observation that the s...

متن کامل

A partition-based algorithm for clustering large-scale software systems

Clustering techniques are used to extract the structure of software for understanding, maintaining, and refactoring. In the literature, most of the proposed approaches for software clustering are divided into hierarchical algorithms and search-based techniques. In the former, clustering is a process of merging (splitting) similar (non-similar) clusters. These techniques suffered from the drawba...

متن کامل

An employee transporting problem

An employee transporting problem is described and a set partitioning model is developed. An investigation of the model leads to a knapsack problem as a surrogate problem. Finding a partition corresponding to the knapsack problem provides a solution to the problem. An exact algorithm is proposed to obtain a partition (subset-vehicle combination) corresponding to the knapsack solution. It require...

متن کامل

DISCRETE AND CONTINUOUS SIZING OPTIMIZATION OF LARGE-SCALE TRUSS STRUCTURES USING DE-MEDT ALGORITHM

Design optimization of structures with discrete and continuous search spaces is a complex optimization problem with lots of local optima. Metaheuristic optimization algorithms, due to not requiring gradient information of the objective function, are efficient tools for solving these problems at a reasonable computational time. In this paper, the Doppler Effect-Mean Euclidian Distance Threshold ...

متن کامل

Simple and Efficient Algorithm for Approximate Dictionary Matching

This paper presents a simple and efficient algorithm for approximate dictionary matching designed for similarity measures such as cosine, Dice, Jaccard, and overlap coefficients. We propose this algorithm, called CPMerge, for the τ overlap join of inverted lists. First we show that this task is solvable exactly by a τ -overlap join. Given inverted lists retrieved for a query, the algorithm coll...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005